In this study we will be predicting which water pumps throughout Tanzania are functional, and which do not work at all based on a number of variables. A smart understanding of which waterpoints will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania. To achieve this we have at our disposal a dataset from drivendata1 in which we have 59400 unique registers and 41 different features. The orginal source is the “Taarifa waterpoints dashboard”2 3 4, which aggregates data from the Tanzania Ministry of Water.
The available data can be broke down into three classes: about what kind of pump is operating, when it was installed, and how it is managed. The dataset also has a terniary target feature: which pumps are functional, which need some repairs, and which don’t work at all. We found many of the features we have are either useless to predict whether the pump is functional or not, or are filled with missing values. There are also groups of three variables which are encompassed with eachother, repeating most of the same information in a hierarchical manner.
Before we found this dataset we considered a dataset from Kaggle, the dataset was whether a subscription user, on a certain music platform, will churn or not5. We found three problems with this dataset and decided it was not optimal for this project. First, all variables were coded in ways which we would not be able to contextualize any of it. To ensure the privacy of their consumers, all data was scrambled and it was impossible for us to bring a better analysis without more concrete data. The second and third, are related with the size and structure of the dataset. The size of the dataset was unfathamabole, with over 32GB of uncompressed data our laptops were struggling to import them, even when we were able to achieve a successful import it was not with the familiar data.frame. On top of this, the structure of the data was burdensome as well. There were 3 different datasets, each with different data and structure. The first one had the information of every client and reasonable enough. The second one had, for each month, the method the client payed for the subscription. The third one had, for each day, the amount of songs that they listened to and the amount of songs that they listend from start to finish, the amount of songs they only listned 50% of, etc. We had in mind joining all three tables and using our usual techniques and algorithms, but how would be join it? The fundamental part of this problem was predicting if a user would churn depending on the amount of songs they listened to, to see if a pattern emerged from the user. If we joined it in the way we are used to we would have unused most of the information.
The following study is broken down in three parts. The first, is a descriptive processing of the data, in which we will report the most relevant features found that relate to the target feature and also correct any mistakes ending up with a clean dataset which we will use in the following parts. The second part, consists of exploration techniques, PCA and clustering. Although, since we only dispose of a few numerical variables, we will not do a regular PCA, but a MCA. Finally, the third part will consist of the predictive algorithms we deemed optimal for our given dataset and return the predicitve quality that returned the best results.
Initially our target variable status_group, which recorded the status of the well in three categories, contained the level functional needs repair which we removed due to the sheer interest of the current study to determinate whether a well is or isn’t functional, a binary target. Besides the study’s aim, the deleted level was scarce, therefore it’s removal makes the response variable be almost perfectly balanced. After removing these values we ended up with 57,588 observations. Second, we decided to also remove the variables which had missing values at either latitude and consequently missing values at longitude, more information on this later, but we did this now to better select the sample.
After first plowing through with all the observations, we ran into problems with the computational costs of some algorithms, such as random forest, was too high, a decision of reducing the database’s size was made. We took a random sample of only 20,000 rows, reducing 66% of the orginal size. Previous to making a sample we checked to see if there was an intrinisc structure to the data, as could be if it was temporal data or any other structure, none was found so we proceeded to create the sample. It was important to check that we did not break the original strucutre of the data as we wanted a sample to correctly sample to whole.
In this following section we shall talk about what we did to preprocess the data. Only the most relevant plots and statistics will be shown in this section, if you want any other plot not presented here they are appended at the annex.
Our prerpocessing will consist basically of:
After a brief examination and solution proposal, we decided to remove certain variables from the dataset. Overall, variables are removed because of:
Containing too many levels: Id, wpt_name, subvillage.
Being contained in other variables (_type, _class, _group), these variables have a hierarchical structure each containing the same information from the previous level, sometimes they add a bit more complexity that is not needed: extraction_type_group, extraction_type_class, payment_type, quality_group, quantity_group, source_type, waterpoint_type_group.
Not containing relevant information, these either have the same information collected on other similar variables or the values are all the same: region_code, num_private, recorded_by
| x | |
|---|---|
| id | 20000 |
| wpt_name | 14167 |
| num_private | 48 |
| subvillage | 10068 |
| region_code | 26 |
| recorded_by | 1 |
| extraction_type_group | 13 |
| extraction_type_class | 7 |
| payment_type | 7 |
| quality_group | 6 |
| quantity_group | 5 |
| source_type | 7 |
| waterpoint_type_group | 6 |
From the remaining 28 variables, some still have to be processed or reclassified. Since there are different types of codifying for the missing values in our dataset:
NA codified by ‘0’: longitude, construction_year
NA codified by -2e-08: latitude
NA codified by unknown: management, management_group, payment, water_quality, quantity, source, source_class
NA codified by void character: public_meeting, scheme_management (which also has a None level which we decided to also say was missisng), scheme_name, permit
How do we know to remove those values from Latitude and longitude? They are trickier to see, as we need both of these values together to know if they make sense or not. To view them correctly, we can do it with a map, since we know all the values are well pumps in Tanzania we can probably say that those few values at \((0,0)\) are missing values. The same can be said about \(-2e-08\) which must have been a zero, but as a float number it got messed up in translation.
Latitude and Longitude
Again since we need both of these values to be able to infer any relevant information we will categorize it to quadrants. Since we know that the country in question is Tanzania we can easily find that it’s latitude is compressed between \((0, -12)\) and it’s longitude is compressed between \((29,41)\)6.
Quadrants of the well pumps
There are only two variable in the dataset which are incorrectly declared. The first one is date_recorded, which we will have to change the format from factor to Date. The second one is district_code, which is declared as an integer and it should be factor as it is not a numerical variable, even though it is coded as one.
Some variables still have to be processed or reclassified with a more complex procedure. These variables are:
| x |
|---|
| amount_tsh |
| gps_height |
| installer |
| population |
| scheme_name |
amount_tsh: Total static head (amount of water available to waterpoint), it seems the variable might been wrongly collected. We have a large amount of zeros and a then huge values. It might be because they used different unit types, but our knowledge in water pump physics is lacking so we don’t really know how to interpret it. We thought of a few options:
We decided to choose the first option, we will create a new variable with 4 different categories. Summary of the new variable:
| x | |
|---|---|
| (-1,1] | 13929 |
| (1,100] | 2577 |
| (100,5e+03] | 3312 |
| (5e+03,1e+07] | 182 |
amount_tsh categorized into 4 different categories
After seeing how the CART separated by this variable in classifying status_group, we deiced to also show the bivariate plot. We can see that the ones that have higer pressure in their water pump tend to work better than those who don’t.
Distribution of amount_tsh by target variable
gps_height: Altitude of the well. After looking at the altitude plots of Tanzania7 and seeing that there aren’t many places with either zero elevation nor sub-sea levels. What we think probably happened is some people wrote elevation and some wrote depth of the well. Also, there aren’t that many places with zero altitude in Tanzania.
Distribution of gps_height, we can clearly see two distributions, but none regarding
Options:
geonmaes packageWe decided to go with option 2, imputate the real height with the given latitude and longitude. While the first option was quite appealing, we just didn’t dispose of enough data, only 567 which is roughly 0.02835% of the data is not enough. It migh have been interesting if the distribution of wells that had negative values were only the wells of one of the target values, however this was not the case. We imputated the data with the geonames package, which is basically a call to the geonames API8. For most of the values we used the srtm3 (Shuttle Radar Topography Mission) elevation data since we found it was the method it gave the most similar results. However, for the values that this method gave -32768, which are ocean areas that have been masked as “no data”9, we used the GTOPO30 method, which gave different results for these values. For the values that the GTOPO30 function also gave ocean area we manually imputated a value of zero.
Distribution of positive values, ‘Elevation’
Distribution of ‘Elevation’
population population around the well. With 6834 registers that have a population of zero we consider categorizing this variable as well, we also find that the population with only 1 is: 2474which doesn’t make much sense looking at the graph. We might expect that having a working water well they will move closer to it to benefit from it.
Distribution of population according to the target variable
| x | |
|---|---|
| 0% | 0 |
| 25% | 0 |
| 50% | 30 |
| 75% | 225 |
| 100% | 10000 |
Options:
We consider categorizing it because this is a feature that we think could be interesting for posterior analysis.
| x | |
|---|---|
| (-0.5,0] | 6834 |
| (0,1] | 2474 |
| (1,100] | 3110 |
| (100,215] | 2537 |
| (215,2e+03] | 4867 |
| (2e+03,3.05e+04] | 178 |
installer: Organization that installed the well. With 2146 different levels, we find very similar names among them.
Even though there are over 2000 levels, the installer could be a relevant variable. Therefore it will be kept but might be ignored in further stages of this study. Before using it, we will try to preprocess it in order to decrease the number of levels.
In the previous code, the lowerCase function has been used and as a result some there has been a decrease of 2146. Also, the number of levels that had a frequency of 1 went from 635 to 571.
Finally, a conceptual level grouping is performed only for levels with a frequency of over a hundred. These are mostly typos or different ways they spelled things. For example unisef \(\rightarrow\) unicef, oxfarm \(\rightarrow\) oxfam, among many others.
The final grouping consists in considering as “other” all levels apearing less than 10 times throughout the database.
Finally, the database has 188 levels. While we wanted to do a fuzzy match for the previous feature, it was simply not feasible, there are too many levels with similar names, but not enough to be able to just join them. It ended up joining values that were not supposed to be joined before values that were supposed to be joined together.
scheme_name Who operates the waterpoint. Similarly to installer we group all the levels apearing less than 10 times into an “other” category.
We reduced from 2697 to only 235 levels.
Finally we join a few levels in other features:
water_quality We have the levels flouride abandoned and salty abandoned which we joined into their corresponding level flouride and salty. As the abandoned is an adjective describing something other than what the feature is trying to depict.
waterpoint_type We joined the level dam into other as we only had 3 observations of dam.
For the remaining character variables are all converted to factor if appropiate and we will do a droplevels to eliminate any residue that might have been left.
Why haven’t we done the fuzzy match? As stated earlier, we tried to do it, however we ran into problems that it joined similar names that were not supposed to be joined before things we thought that they were supposed to be. To do it correctly we would need the help of an expert in the local language to be able to discern what is the same and what is a typo. We also considered joining all the levels with few observations into a lump other (like we did with water_quality), however, doing so, we ended up distorting the original data distribution. The missing values we got were horrible compared to not lumping them together, the same with the models.
In order to not bias the analysis we shall remove the rows with missing values in either the latitude or longitude. We can afford to do this because the missing data is very small relatively to the size of the dataset. For the rest of the missing values we will use MICE, however we will not imputate on the categorical variables, because we feel that a missing value is actually more useful to us, we might think that having a missing value in a category is for a reason and that not knowing might help the models in finding better values. After this first draft we might change it to try and imputate some missing values of some more relevant categories. For now, though, we will only imputate the numerical variable, construction_year because it is the only one that has missing values.
First let’s check the amonut of missing values that we have in our dataset. We shall remove these variables to help the MICE function. Secondly, we didn’t actually do the MICE first because since we have variables with many levels we were running in combinatorial problems and the computational time was to high to be able to do anything. The find out what variables we will use for the MICE we use first the aregImpute function from Hmisc package, playing with it we find the optimal variables that we can use and will give us a first estimate of the missing values.
Using the same variables used for aregImpute we finally do the MICE. Although it was probably innecessary because the difference between the both are neglibile. Looking at the following plots we can see that MICE did a pretty good job.
Density of the imputated values vs. the real values
Density of the imputated values vs. the real values by the target feature
In this section we will look at some of the more relevant features that relate to the target variable.
Our target variable has three different modalities. Although we could group functional needs repair with non functional as both need repair, and that is the objective of this project finding out which ones need repair. For now let’s keep it this way. We see that the response is not balnaced (58.5% vs. 41.5%).
Unfortunatley we do not seem to see any clear relationship between the latitude and longitude and if they work or not. Nor any of them, all the bivariate plots are found in the annex.